Task 1 - 2 Using the Odd Data to Obtain Information About the Game Outcomes

Objective

The objective of the project is to understand if we can obtain any significant information regarding the game outcomes using the odd data from multiple bookmakers. The PCA and MDS with euclidian distance and manhattan distance approaches are used.

To apply PCA and MDS, 5 bookmakers are selected:

1) SBOBET

2) Tipico

3) bwin

4) 12BET

5) Unibet

Task 3 Compressing Images Using PCA

Objective

The objective is to perform some operations on a image such as displaying RGB channels of the image, adding random noise to the image, extracting patches, applying PCA to those patches and plotting images formed by eigenvectors of PCA. By applying PCA, the image will be compressed.

Task 1.a - 2

In this part, 5 bookmakers are selected to check if over/under 2.5 game result can be explained by the odds for different types of bets. For that purpose, the PCA approach is used.

To deal with large amount of data, make manipulations and plot the insights, data.table, anytime and ggbiplot,devtools and plotly packages are recalled.

library(anytime)
library(data.table)
library(ggbiplot)
library(devtools)
library(plotly)

To read data, readRDS function is used.

matches_datapath='Desktop/df9b1196-e3cf-4cc7-9159-f236fe738215_matches-2.rds'
odd_details_datapath='Desktop/df9b1196-e3cf-4cc7-9159-f236fe738215_odd_details.rds'
matches=data.table(readRDS(matches_datapath))
odds=readRDS(odd_details_datapath)

Since we are asked to inspect over/under and 1x2 scores, in order to quantify match result, total goals are calculated to understand whether the game result is under or over and check the game result (Home wins, Away wins or Tie)

To indicate the matches with under and over results, additional IsOver column is added. For the home,away and tie game results, Result column is added.

matches[,c("HomeGoals","AwayGoals"):=tstrsplit(score,':')]
matches$HomeGoals=as.numeric(matches$HomeGoals)
matches[,AwayGoals:=as.numeric(AwayGoals)]
matches[,TotalGoals:=HomeGoals+AwayGoals]
matches[,IsOver:="Under"]
matches[TotalGoals>2,IsOver:="Over"]
matches[,match_time:=anytime(date)]
matches[,Year:=year(match_time)]
matches[,Result:="Tie"]
matches = matches[HomeGoals> AwayGoals,Result:="Home"]
matches = matches[AwayGoals> HomeGoals,Result:="Away"]

Since we are only interested in SBOBET, Tipico, bwin, 12Bet and Unibet, the odds data table is filtered. (-1) Moreover, to increase the reliability, only final odds for the games are included. (-2) As bet types, “Home, Away or Tie” (1x2) , “Over or Under” (ou), “Both Teams Score” (bts) and “Double Chance”(dc) are used with 2.5, 0.5 and No Handicap options. (-2)

odds=odds[bookmaker %in% c('SBOBET','Tipico', 'bwin', '12BET', 'Unibet')] # (-1 )

odds_final = odds [betType %in% c('1x2','ou', 'bts', 'dc'),list(last_odd = odd[.N]),by=list(matchId,bookmaker,betType,oddtype,totalhandicap)]
odds_final=odds_final[totalhandicap %in% c(2.5,NA,0.5) ] # (-2)
odds_final=odds_final[order(matchId,bookmaker,betType,oddtype,totalhandicap)]

There is a need to merge matches dataset with odds dataset to obtain integrate odds and game result. Since the structure of the datasets are different, we apply dcast operation and merge two datasets by matchId which is the common attribute.

odds_final_wide=dcast(odds_final, matchId ~bookmaker+betType + oddtype + totalhandicap, value.var='last_odd')

odds_final_wide = odds_final_wide[complete.cases(odds_final_wide)]
merge_final_odds_match = merge(odds_final_wide,matches,by = 'matchId')
merge_final_odds_match = merge_final_odds_match[order(matchId)]

PCA Method

Since the ranges in the odds are different in each column, before applying PCA, we need to scale the dataset by using scale function. While scaling the matchId’s are excluded since the scale and PCA function require numerical data. The identification will be based on the row numbers since all data sets are ordered according to their matchId before. :

scaled_merged_final_odds_match = scale(merge_final_odds_match[,c(2:48)])

PCA is applied by using princomp function.

final_odds_PCA_scaled = princomp(scaled_merged_final_odds_match,cor = TRUE)

To evaluate the components, we need to get the summary of PCA.

summary(final_odds_PCA_scaled)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     4.5357220 3.8650427 2.8670447 0.77388858
## Proportion of Variance 0.4377186 0.3178416 0.1748924 0.01274263
## Cumulative Proportion  0.4377186 0.7555602 0.9304526 0.94319527
##                             Comp.5      Comp.6      Comp.7      Comp.8
## Standard deviation     0.647199982 0.598863389 0.531801908 0.485126064
## Proportion of Variance 0.008912081 0.007630582 0.006017304 0.005007389
## Cumulative Proportion  0.952107353 0.959737935 0.965755238 0.970762628
##                             Comp.9     Comp.10     Comp.11     Comp.12
## Standard deviation     0.429360041 0.426648328 0.360336999 0.314516485
## Proportion of Variance 0.003922341 0.003872953 0.002762612 0.002104694
## Cumulative Proportion  0.974684969 0.978557922 0.981320534 0.983425228
##                            Comp.13     Comp.14     Comp.15     Comp.16
## Standard deviation     0.287350482 0.278519854 0.261739468 0.257836957
## Proportion of Variance 0.001756815 0.001650496 0.001457607 0.001414466
## Cumulative Proportion  0.985182043 0.986832539 0.988290146 0.989704612
##                          Comp.17    Comp.18      Comp.19      Comp.20
## Standard deviation     0.2258832 0.22189880 0.2063181783 0.1902724705
## Proportion of Variance 0.0010856 0.00104764 0.0009056849 0.0007702896
## Cumulative Proportion  0.9907902 0.99183785 0.9927435373 0.9935138270
##                             Comp.21      Comp.22      Comp.23      Comp.24
## Standard deviation     0.1669100415 0.1602611145 0.1517401104 0.1482751887
## Proportion of Variance 0.0005927439 0.0005464601 0.0004898949 0.0004677773
## Cumulative Proportion  0.9941065708 0.9946530309 0.9951429258 0.9956107031
##                             Comp.25      Comp.26      Comp.27      Comp.28
## Standard deviation     0.1431548489 0.1340652265 0.1287795565 0.1265330933
## Proportion of Variance 0.0004360279 0.0003824146 0.0003528548 0.0003406516
## Cumulative Proportion  0.9960467310 0.9964291456 0.9967820003 0.9971226519
##                             Comp.29      Comp.30      Comp.31      Comp.32
## Standard deviation     0.1195167453 0.1134467265 0.1114612819 0.1056468792
## Proportion of Variance 0.0003039203 0.0002738332 0.0002643323 0.0002374737
## Cumulative Proportion  0.9974265722 0.9977004054 0.9979647376 0.9982022113
##                            Comp.33      Comp.34      Comp.35      Comp.36
## Standard deviation     0.099698045 0.0962413280 0.0915242606 0.0858710355
## Proportion of Variance 0.000211483 0.0001970722 0.0001782275 0.0001568901
## Cumulative Proportion  0.998413694 0.9986107665 0.9987889940 0.9989458841
##                             Comp.37      Comp.38      Comp.39      Comp.40
## Standard deviation     0.0841944192 0.0827945964 0.0793298921 0.0731559146
## Proportion of Variance 0.0001508234 0.0001458499 0.0001338985 0.0001138678
## Cumulative Proportion  0.9990967075 0.9992425574 0.9993764559 0.9994903237
##                             Comp.41      Comp.42      Comp.43      Comp.44
## Standard deviation     0.0716766747 0.0691029653 6.435270e-02 5.960248e-02
## Proportion of Variance 0.0001093095 0.0001016004 8.811214e-05 7.558417e-05
## Cumulative Proportion  0.9995996332 0.9997012336 9.997893e-01 9.998649e-01
##                             Comp.45      Comp.46      Comp.47
## Standard deviation     5.441385e-02 0.0482453990 3.255467e-02
## Proportion of Variance 6.299717e-05 0.0000495238 2.254907e-05
## Cumulative Proportion  9.999279e-01 0.9999774509 1.000000e+00

In the PCA, the most important part is to cover variance as much as possible. As it can be seen from the output above, the first principal component covers the 43% of variance, second component covers the 32% of variance which adds up to 75%. However, if we include the third component, the covered variance would be around 93% which is more favorable. So, as a result, our PCA graph should be in 3 dimensions. Adding another dimension will not add any significance improvement in the variance coverage, so it is not needed.

Another way to decide the dimension of the PCA graph is understanding the plot below. As we can see, the major fractions in variance coverage are in the first, second and third components.

To evaluate the eigenvectors, we need to keep information about the contributors of eigenvectors (PC1,PC2 and PC3) as a different data set by manipulating the loadings of the PCA as below:

eigenve<-loadings(final_odds_PCA_scaled)[,1:3]
eigenve
eigenve_PC1= eigenve[order(eigenve[,1]),] 
eigenve_PC2= eigenve[order(eigenve[,2]),]
eigenve_PC3= eigenve[order(eigenve[,3]),]

For the first principal component:

eigenve_PC1 
##                            Comp.1       Comp.2       Comp.3
## Unibet_dc_12_NA     -0.1984138363 -0.022197593 -0.104743705
## bwin_dc_12_NA       -0.1948326347 -0.019501770 -0.120701815
## Tipico_dc_12_NA     -0.1931446282 -0.013327537 -0.118133074
## SBOBET_dc_12_NA     -0.1908812995 -0.002603716 -0.133059626
## bwin_ou_over_2.5    -0.1869107977 -0.088653035  0.117378853
## bwin_ou_over_0.5    -0.1852193360 -0.079621867  0.087271561
## Tipico_ou_over_2.5  -0.1832875848 -0.096399914  0.114511433
## Unibet_ou_over_2.5  -0.1830633017 -0.093193258  0.121566746
## SBOBET_ou_over_2.5  -0.1810539386 -0.095170575  0.115261967
## Unibet_ou_over_0.5  -0.1796395026 -0.076023545  0.083853895
## 12BET_ou_over_2.5   -0.1788645225 -0.094511553  0.110342009
## bwin_bts_NO_NA      -0.0127075183  0.096708523 -0.308439276
## Unibet_bts_NO_NA    -0.0039049332  0.107065109 -0.293436815
## Tipico_bts_NO_NA    -0.0009964646  0.114802471 -0.291557655
## SBOBET_dc_1X_NA      0.0037954932  0.231182307  0.141032841
## SBOBET_1x2_odd1_NA   0.0091846063  0.226081610  0.158308314
## Tipico_dc_1X_NA      0.0105525820  0.231873057  0.139371506
## bwin_1x2_odd1_NA     0.0106673003  0.226837686  0.158664762
## 12BET_1x2_odd1_NA    0.0109975406  0.225958320  0.157974582
## Tipico_1x2_odd1_NA   0.0114024425  0.226823612  0.159149222
## Unibet_dc_1X_NA      0.0117983149  0.232552075  0.141347411
## Unibet_1x2_odd1_NA   0.0121363600  0.225390907  0.160488929
## bwin_dc_1X_NA        0.0129525304  0.231969597  0.144215691
## Tipico_bts_YES_NA    0.0218455481 -0.123407081  0.284212227
## Unibet_bts_YES_NA    0.0268139428 -0.112551845  0.289654159
## bwin_bts_YES_NA      0.0354979745 -0.098979206  0.301707609
## SBOBET_dc_X2_NA      0.1332668026 -0.200006077 -0.011360277
## bwin_1x2_odd2_NA     0.1349133420 -0.200281161  0.004297161
## Tipico_dc_X2_NA      0.1356670602 -0.196795645  0.001797264
## SBOBET_1x2_odd2_NA   0.1365652421 -0.197538183  0.006060772
## Unibet_dc_X2_NA      0.1367751904 -0.199396547  0.003603614
## bwin_dc_X2_NA        0.1369471809 -0.199355152  0.002117352
## 12BET_1x2_odd2_NA    0.1372687975 -0.197262938  0.007815795
## Unibet_1x2_odd2_NA   0.1373565259 -0.198294425  0.011221393
## Tipico_1x2_odd2_NA   0.1374104964 -0.198752945  0.009558868
## 12BET_ou_under_2.5   0.1845124314  0.089653856 -0.102936724
## SBOBET_ou_under_2.5  0.1857954344  0.090367569 -0.107550927
## Tipico_ou_under_2.5  0.1862331251  0.085539769 -0.104500058
## Unibet_ou_under_0.5  0.1884114064  0.064804071 -0.054665059
## Unibet_ou_under_2.5  0.1897267586  0.082610619 -0.101361722
## bwin_ou_under_2.5    0.1930898700  0.078702822 -0.102095867
## bwin_ou_under_0.5    0.1945358854  0.066253675 -0.071743159
## SBOBET_1x2_oddX_NA   0.1956040364  0.002267699  0.118302168
## 12BET_1x2_oddX_NA    0.1960729227 -0.002862058  0.118135764
## Unibet_1x2_oddX_NA   0.1976659551  0.007883224  0.120073794
## bwin_1x2_oddX_NA     0.1987426605  0.003219621  0.117340512
## Tipico_1x2_oddX_NA   0.1991475284  0.003699527  0.119019099

The odds for Tie results from bookmakers Tipico, bwin, Unibet, 12BET and SBOBET have the highest contributors to PC1 eigenvector direction. Moreover, the under 2.5 odds have also great weight on the Principal Component.

Note: Recall that an eigenvector is a direction, such as “vertical” or “45 degrees”, while an eigenvalue is a number telling you how much variance there is in the data in that direction.

For the second principal component eigenvector:

eigenve_PC2
##                            Comp.1       Comp.2       Comp.3
## bwin_1x2_odd2_NA     0.1349133420 -0.200281161  0.004297161
## SBOBET_dc_X2_NA      0.1332668026 -0.200006077 -0.011360277
## Unibet_dc_X2_NA      0.1367751904 -0.199396547  0.003603614
## bwin_dc_X2_NA        0.1369471809 -0.199355152  0.002117352
## Tipico_1x2_odd2_NA   0.1374104964 -0.198752945  0.009558868
## Unibet_1x2_odd2_NA   0.1373565259 -0.198294425  0.011221393
## SBOBET_1x2_odd2_NA   0.1365652421 -0.197538183  0.006060772
## 12BET_1x2_odd2_NA    0.1372687975 -0.197262938  0.007815795
## Tipico_dc_X2_NA      0.1356670602 -0.196795645  0.001797264
## Tipico_bts_YES_NA    0.0218455481 -0.123407081  0.284212227
## Unibet_bts_YES_NA    0.0268139428 -0.112551845  0.289654159
## bwin_bts_YES_NA      0.0354979745 -0.098979206  0.301707609
## Tipico_ou_over_2.5  -0.1832875848 -0.096399914  0.114511433
## SBOBET_ou_over_2.5  -0.1810539386 -0.095170575  0.115261967
## 12BET_ou_over_2.5   -0.1788645225 -0.094511553  0.110342009
## Unibet_ou_over_2.5  -0.1830633017 -0.093193258  0.121566746
## bwin_ou_over_2.5    -0.1869107977 -0.088653035  0.117378853
## bwin_ou_over_0.5    -0.1852193360 -0.079621867  0.087271561
## Unibet_ou_over_0.5  -0.1796395026 -0.076023545  0.083853895
## Unibet_dc_12_NA     -0.1984138363 -0.022197593 -0.104743705
## bwin_dc_12_NA       -0.1948326347 -0.019501770 -0.120701815
## Tipico_dc_12_NA     -0.1931446282 -0.013327537 -0.118133074
## 12BET_1x2_oddX_NA    0.1960729227 -0.002862058  0.118135764
## SBOBET_dc_12_NA     -0.1908812995 -0.002603716 -0.133059626
## SBOBET_1x2_oddX_NA   0.1956040364  0.002267699  0.118302168
## bwin_1x2_oddX_NA     0.1987426605  0.003219621  0.117340512
## Tipico_1x2_oddX_NA   0.1991475284  0.003699527  0.119019099
## Unibet_1x2_oddX_NA   0.1976659551  0.007883224  0.120073794
## Unibet_ou_under_0.5  0.1884114064  0.064804071 -0.054665059
## bwin_ou_under_0.5    0.1945358854  0.066253675 -0.071743159
## bwin_ou_under_2.5    0.1930898700  0.078702822 -0.102095867
## Unibet_ou_under_2.5  0.1897267586  0.082610619 -0.101361722
## Tipico_ou_under_2.5  0.1862331251  0.085539769 -0.104500058
## 12BET_ou_under_2.5   0.1845124314  0.089653856 -0.102936724
## SBOBET_ou_under_2.5  0.1857954344  0.090367569 -0.107550927
## bwin_bts_NO_NA      -0.0127075183  0.096708523 -0.308439276
## Unibet_bts_NO_NA    -0.0039049332  0.107065109 -0.293436815
## Tipico_bts_NO_NA    -0.0009964646  0.114802471 -0.291557655
## Unibet_1x2_odd1_NA   0.0121363600  0.225390907  0.160488929
## 12BET_1x2_odd1_NA    0.0109975406  0.225958320  0.157974582
## SBOBET_1x2_odd1_NA   0.0091846063  0.226081610  0.158308314
## Tipico_1x2_odd1_NA   0.0114024425  0.226823612  0.159149222
## bwin_1x2_odd1_NA     0.0106673003  0.226837686  0.158664762
## SBOBET_dc_1X_NA      0.0037954932  0.231182307  0.141032841
## Tipico_dc_1X_NA      0.0105525820  0.231873057  0.139371506
## bwin_dc_1X_NA        0.0129525304  0.231969597  0.144215691
## Unibet_dc_1X_NA      0.0117983149  0.232552075  0.141347411

What is interesting in that case is, double chance odds for “Home” and “Tie” have the highest contribution to the second principal component eigenvector while Tie odds have the highest impact in the first principal component. Moreover, at the second place, odds for “Home” wins influence the direction of the second principal component.

For the third principal component eigenvector:

eigenve_PC3
##                            Comp.1       Comp.2       Comp.3
## bwin_bts_NO_NA      -0.0127075183  0.096708523 -0.308439276
## Unibet_bts_NO_NA    -0.0039049332  0.107065109 -0.293436815
## Tipico_bts_NO_NA    -0.0009964646  0.114802471 -0.291557655
## SBOBET_dc_12_NA     -0.1908812995 -0.002603716 -0.133059626
## bwin_dc_12_NA       -0.1948326347 -0.019501770 -0.120701815
## Tipico_dc_12_NA     -0.1931446282 -0.013327537 -0.118133074
## SBOBET_ou_under_2.5  0.1857954344  0.090367569 -0.107550927
## Unibet_dc_12_NA     -0.1984138363 -0.022197593 -0.104743705
## Tipico_ou_under_2.5  0.1862331251  0.085539769 -0.104500058
## 12BET_ou_under_2.5   0.1845124314  0.089653856 -0.102936724
## bwin_ou_under_2.5    0.1930898700  0.078702822 -0.102095867
## Unibet_ou_under_2.5  0.1897267586  0.082610619 -0.101361722
## bwin_ou_under_0.5    0.1945358854  0.066253675 -0.071743159
## Unibet_ou_under_0.5  0.1884114064  0.064804071 -0.054665059
## SBOBET_dc_X2_NA      0.1332668026 -0.200006077 -0.011360277
## Tipico_dc_X2_NA      0.1356670602 -0.196795645  0.001797264
## bwin_dc_X2_NA        0.1369471809 -0.199355152  0.002117352
## Unibet_dc_X2_NA      0.1367751904 -0.199396547  0.003603614
## bwin_1x2_odd2_NA     0.1349133420 -0.200281161  0.004297161
## SBOBET_1x2_odd2_NA   0.1365652421 -0.197538183  0.006060772
## 12BET_1x2_odd2_NA    0.1372687975 -0.197262938  0.007815795
## Tipico_1x2_odd2_NA   0.1374104964 -0.198752945  0.009558868
## Unibet_1x2_odd2_NA   0.1373565259 -0.198294425  0.011221393
## Unibet_ou_over_0.5  -0.1796395026 -0.076023545  0.083853895
## bwin_ou_over_0.5    -0.1852193360 -0.079621867  0.087271561
## 12BET_ou_over_2.5   -0.1788645225 -0.094511553  0.110342009
## Tipico_ou_over_2.5  -0.1832875848 -0.096399914  0.114511433
## SBOBET_ou_over_2.5  -0.1810539386 -0.095170575  0.115261967
## bwin_1x2_oddX_NA     0.1987426605  0.003219621  0.117340512
## bwin_ou_over_2.5    -0.1869107977 -0.088653035  0.117378853
## 12BET_1x2_oddX_NA    0.1960729227 -0.002862058  0.118135764
## SBOBET_1x2_oddX_NA   0.1956040364  0.002267699  0.118302168
## Tipico_1x2_oddX_NA   0.1991475284  0.003699527  0.119019099
## Unibet_1x2_oddX_NA   0.1976659551  0.007883224  0.120073794
## Unibet_ou_over_2.5  -0.1830633017 -0.093193258  0.121566746
## Tipico_dc_1X_NA      0.0105525820  0.231873057  0.139371506
## SBOBET_dc_1X_NA      0.0037954932  0.231182307  0.141032841
## Unibet_dc_1X_NA      0.0117983149  0.232552075  0.141347411
## bwin_dc_1X_NA        0.0129525304  0.231969597  0.144215691
## 12BET_1x2_odd1_NA    0.0109975406  0.225958320  0.157974582
## SBOBET_1x2_odd1_NA   0.0091846063  0.226081610  0.158308314
## bwin_1x2_odd1_NA     0.0106673003  0.226837686  0.158664762
## Tipico_1x2_odd1_NA   0.0114024425  0.226823612  0.159149222
## Unibet_1x2_odd1_NA   0.0121363600  0.225390907  0.160488929
## Tipico_bts_YES_NA    0.0218455481 -0.123407081  0.284212227
## Unibet_bts_YES_NA    0.0268139428 -0.112551845  0.289654159
## bwin_bts_YES_NA      0.0354979745 -0.098979206  0.301707609

Unlike the other two components, in the third principal component, odds for “Both Team Scores” have the highest impact on the direction. They are followed by the “Home” wins odds.

To see the distribution of games, we need to plot the PCA in three dimensions as mentioned before. Plotly package is used for that purpose. Since the main task asks for the distinction between over and under results, we need to color code the points in the plot. The 3D Plot can be found as below:

In the interactive 3D plot, there is no significant distinct differences in the distribution of games results with over and under (marked as red and green respectively). If you click on the Under legend, you will see the only games result with over 2.5 scores. By looking both of them separately and together, it can be said that the under games distributed in a larger space because of the 5 outliers coded as 178 (coordinates are 18,-10,6), 63, 219, 131 and 547.

To see the distinction between over and under results, we can use the 2D graph and circle ellipses for the space each group covered. Since most of the variance is covered in PC1 and PC2, we can draw 2D. The plot is as follows:

As it can be understood clearly, there is no significant difference in the space they cover.

Subtask - Code the games according to results (Home, Tie, Away)

Unlike the Over / Under results, the Home / Tie / Away results have clear distinctions, especially between Away and Home results. In other words, the Home and Away clusters are more obvious in this plot. The Home results are placed on the negative side of the PC2 while Away results covers the positive side of the PC2.

Using 2D plots to see the clusters in a different way can also be applied to the Home / Away / Tie results.

Understanding the difference between clusters can be seen easily. Since most of the variance is covered by PC1 and PC2, they are used in the 2D visualization. The directions of Away and Home results clusters are nearly orthogonal and the Tie result cluster stay at their conjunction.

Task 1.b - 1.c Multi Dimensional Analysis (MDS) Method

In this method,the Euclidean and Manhattan distance measures are used separetly.

To calculate both distances, the “Mass” package is called and data table for the MDS operations is formed.The matchId’s are excluded since the distance calculation functions and MDS function require numerical data. The identification will be based on the row numbers since all data sets are ordered according to their matchId before.

library(MASS)
odds_MDS_wide = odds_final_wide[,matchId:=NULL]

Euclidean Distance

Before calculating the distances, there is a need for scaling as it is done before the PCA. Then, the distance matrix is calculated by using dist function in the MASS package.

scaled_odds_MDS_wide=scale(odds_MDS_wide)

distancematrix_eucli=dist(scaled_odds_MDS_wide, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)

To apply MDS, cmdscale function is called. Since we are going to compare it with the PCA, the dimension of MDS is 3.

fit_euc <- cmdscale(distancematrix_eucli,eig=TRUE, k=3) # k is the number of dim

The resulting plot is obtained by using plotly function

The result is similar to the PCA’s. There is no clear distinction between the game results Over and Under. Most of the points (games) have very similar locations with mostly same neighbours.The major difference is the overall direction of the games distribution. The shape is rotated by around 90 degrees which is not important since overall shape type is more significant.

Manhattan Distance

As in the Euclidean Distance case, we need to scale data before calculating the distance.Since it is calculated for the euclidean distance, we can use that data table. Then same function is used with different method “manhattan” instead of “euclidean”

distancematrix_manh<-dist(scaled_odds_MDS_wide, method = "manhattan", diag = TRUE, upper = TRUE, p = 2)

Same as the Euclidean, to apply MDS, cmdscale function is called with 3 dimensions

fit_man <- cmdscale(distancematrix_manh,eig=TRUE, k=3) # k is the number of dim

The resulting plot is obtained by using plotly function

The result is similar to the PCA and MDS with Euclidean distance. However, the direction of the shape is different. What is interesting is, the distance between the games in the Manhattan case is larger than the Euclidean. The reason behind this is, while taking third level root of the sum of squares of coordinate differences in the Euclidean, the Manhattan takes the absolute values. So the distance in Manhattan case is expected to be higher.

In the MDS, there is only mapping, however in the PCA case, there is also dimension reduction with covering variance as much as possible which is handy while dealing with high dimensional data.

Task 3

Objective

The objective is to perform some operations on a image such as displaying RGB channels of the image, adding random noise to the image, extracting patches, applying PCA to those patches and plotting images formed by eigenvectors of PCA. By applying PCA, the image will be compressed.

Task 3.1-3.2 Displaying Channels and Basic Image Operations

In order to perform operations, some packages are needed to be called. The image of a football is read by using readJPEG function.

library(jpeg)
library(imager)
library(DataCombine)
library(dplyr)

myimage = readJPEG("Desktop/football_small.jpg", native = FALSE) 

The initial image is as below:

NOTE In the instructions document, the required size of the image is stated as 512 x 512 pixels. However due to the computational capacity constraints the size is reduced to 64 x 64. However, the codes are the same as the 512 case, if there are better computational power, the same code can be used for the 512 x 512 size.

To obtain more information about the structure of image, we call dim() function

dim(myimage)
## [1] 64 64  3

The image has 3 dimensions with 64 x 64 pixels.

To display each channel(R,G,B),dimensions, different matrices are formed as follows:

red=myimage[,,1]
green=myimage[,,2]
blue=myimage[,,3]

To display each channel in a single plot, the par() function is used and the single plot is formed

par(mfrow=c(1,3))

The rainbow colors are selected to show the level of intentisities of each channel in the initial 3 dimensional image. As a quick example, blue channel’s intensity is high (dark pink) in the sky part of the 3 dimensional football image.

Task 3.3 Adding Random Noise to Image and Displaying Each Channel

To add random noise, new array is formed and random values between 0 and 0.1 is assigned in 3 dimensions with 64 columns and 64 rows.

random_noise<-array(runif(12288,min=0,max=0.1),dim=(c(64,64,3)))

To obtain noisy image, the initial football image and random_noise array is merged.

noisy_image <- random_noise + myimage

Since some of the values in the cells are greater than 1 and they need to be smaller or equal to 1, we perform scaling.

noisy_image<-(noisy_image - min(noisy_image))/(max(noisy_image)-min(noisy_image))

The plot image of the noisy football image is as below.

To display the RGB channels, we perform same operations as the displaying channels of the initial image.

The overall intentisites are similar to the image without random noise. However there are some randomly changed significant pixels that are not similar to the other pixels around. The difference between the initial channels and noise added channel can be easily understood and the effect of the noise to the image can be understood clearly.

Task 3.4 Extracting Patches, Perform PCA and Plotting Components

We are going to convert 3 dimensional colored image to 1 dimension greyscale.

noisy_image_gray = (noisy_image[,,1]+noisy_image[,,2]+noisy_image[,,3])/3 

The greyscale image is as below:

To extract patches, the matrix operations within the for loops are performed. An initial mock data table(dt2) is formed and for each patch which are keept as a vector, converted to a data table and added as a row to existing pre-formed data table with rbind function.

for (i in c(2:63)){
  for (j in c(2:63)){
    
      dt2<-rbind(dt2,as.data.table(t(as.vector(t((noisy_image_gray[(i-1):(i+1),(j-1):(j+1)]))))))
      
  }
}

The data table (dt2) has 9 columns (patch size) and 3,844 rows (62 * 62).

dt2
##              V1        V2        V3        V4        V5        V6
##    1: 0.6220440 0.6535428 0.6725920 0.6734746 0.6321190 0.6552076
##    2: 0.6535428 0.6725920 0.6701251 0.6321190 0.6552076 0.6608269
##    3: 0.6725920 0.6701251 0.6603628 0.6552076 0.6608269 0.6619213
##    4: 0.6701251 0.6603628 0.6700721 0.6608269 0.6619213 0.6786896
##    5: 0.6603628 0.6700721 0.6959735 0.6619213 0.6786896 0.7031005
##   ---                                                            
## 3840: 0.3111300 0.3159344 0.2859936 0.2677921 0.2793485 0.3202214
## 3841: 0.3159344 0.2859936 0.2360640 0.2793485 0.3202214 0.2815143
## 3842: 0.2859936 0.2360640 0.2906766 0.3202214 0.2815143 0.2809586
## 3843: 0.2360640 0.2906766 0.2942121 0.2815143 0.2809586 0.3197644
## 3844: 0.2906766 0.2942121 0.2824384 0.2809586 0.3197644 0.3393211
##              V7        V8        V9
##    1: 0.6737584 0.6697641 0.6702187
##    2: 0.6697641 0.6702187 0.7010549
##    3: 0.6702187 0.7010549 0.7057388
##    4: 0.7010549 0.7057388 0.6945347
##    5: 0.7057388 0.6945347 0.7106554
##   ---                              
## 3840: 0.2774624 0.3193290 0.3384892
## 3841: 0.3193290 0.3384892 0.3005912
## 3842: 0.3384892 0.3005912 0.2945384
## 3843: 0.3005912 0.2945384 0.3388916
## 3844: 0.2945384 0.3388916 0.3724481

After obtaining the patches in a data table, PCA is applied. There is no need for scaling.

PCA_noisy_image = princomp(dt2,cor = TRUE)
summary(PCA_noisy_image)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3      Comp.4
## Standard deviation     2.8456613 0.65039411 0.50045678 0.294262164
## Proportion of Variance 0.8997542 0.04700139 0.02782855 0.009621136
## Cumulative Proportion  0.8997542 0.94675561 0.97458417 0.984205305
##                             Comp.5      Comp.6     Comp.7      Comp.8
## Standard deviation     0.241172830 0.177339980 0.16503517 0.125840319
## Proportion of Variance 0.006462704 0.003494385 0.00302629 0.001759532
## Cumulative Proportion  0.990668009 0.994162394 0.99718868 0.998948215
##                             Comp.9
## Standard deviation     0.097293690
## Proportion of Variance 0.001051785
## Cumulative Proportion  1.000000000

As we can see from the summary table, the first two components cover the 95% of total variance, which is desirable. There is no need to include the third component.

ggbiplot function is used for plotting the PCA results.

As it can be seen from the plot, most of the points are on the PC1 eigenvector direction which keeps the 90 % of total variance. There are similar pattern in the PC2 as well. There is not very significant bias on the points within the principal components.

To reconstruct the image from PC1, PC2 and PC3, the scores of PCA should be transformed to a data table. (-1)

From that data table, each PC should obtain its own matrix. (-2)

Since this matrix has one row with 9 columns, it should be transformed to 3x3 by performing cut operation. (-3)

Then, the scaling is necessary because the plotting image function requires pixel smaller or equal to 1. (-4)

dt_new_noisy_image = data.table(PCA_noisy_image$scores) #(-1)

dt_PC1_new_noisy_image = dt_new_noisy_image[,1] #(-2) 
PC1_matrix = as.matrix(t(dt_PC1_new_noisy_image)) #(-2)

PC1_matrix = matrix(PC1_matrix,62,62,byrow=TRUE) #(-3)

PC1_matrix_final<-(PC1_matrix - min(PC1_matrix))/(max(PC1_matrix)-min(PC1_matrix)) #(-4)

The same approach is applied to other components and the resulting plots are obtaines as below:

Since PC1 covers the most of the variance, the image that is reconstructed from PC1 is similar to the original image. The PC2 covers the 4% of the variance, so the shape of the football except the environment details can be noticable. In the third image, the variance is covered by 3% and the overall resolution is lower.

Plotting components 3 by 3 image

To plot the components(eigenvectors) as 3 by 3 image, there is a need to obtain loadings and form a 3 by 3 matrix for each of the three components.

eigenve_noise = loadings(PCA_noisy_image)

PC1_eigenve_noise = eigenve_noise[,1]
PC1_eigenve_noise_m = matrix(PC1_eigenve_noise)
matrix_PC1_eigenve = matrix(PC1_eigenve_noise_m,3,3,byrow=TRUE)

Same procedure is applied to other components and the plots are as follows:

The images show the intensities and it is expected to see biased pattern towards the edges. In the first component, the cumulation areas are basically the corners. In the second component, the cumulation areas are basically the top edges In the third component, the cumulation areas are basically the right edges

End of the Document